pacman::p_load(tidyverse, broom, kableExtra, DT, plotly)
set.seed(123)
A/B Testing is a method commonly used in Data Science, Analytics, and Software Engineering to compare two or more options in order to determine which performs better.
Common questions A/B tests are used to answer:
Which website layout results in more user clicks?
Does a new drug reduce symptoms more effectively than the current standard?
Which email subject line drives more opens?
The “A” and “B” refer to the options being compared—such as a control (existing version) and a variation (new version).
Define the hypothesis: ex. “Changing the button color from blue to green increases clicks.”
Randomly split your audience: Half see version A, half see version B.
Measure outcomes: Clicks, conversions, revenue, etc.
Analyze results: Decide if the difference is significant (statistically) or due to chance.
Because our outcome is binary (clicked or not clicked), we can use a chi-square test (\(\chi^2\)) or, a two-proportion z-test. In this example I will use a \(\chi^2\) test.
click_table <- table(ab_data$group, ab_data$click)
chi_result <- chisq.test(click_table)
tidy(chi_result) |>
kable("html", caption = "Chi-Square Test for Difference in Click Rates") |>
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| statistic | p.value | parameter | method |
|---|---|---|---|
| 30.90473 | 0 | 1 | Pearson’s Chi-squared test with Yates’ continuity correction |
When we run this experiment, we want to know: - Did the green button really do better, or could it just be random luck?
The p-value helps answer that:
A p-value of 0.05 means there’s about a 5% chance (1 in 20) that the difference we saw happened just by accident, even if the green button isn’t actually better.
In other words, the smaller the p-value, the less likely it is that the results are just random noise.
So if the p-value is less than 0.05, it’s a sign that the difference is probably real — the green button is likely causing more clicks, not just getting lucky.
Important note: It doesn’t mean we’re 100% sure. There’s always a chance we’re wrong. In business and science, a tolerance of about 5% is the usual cutoff for saying, “We’re confident enough to believe this result.”
Thus, we can reject the null hypothesis of the two buttons performing the same. It is clear from the visuals and test that the green button is performing better than the blue in terms of raw clicks.